V 


Model-based  Gaussian  and  non-Gaussian  Clustering 


Jeffrey  D.  Banfield 

Department  of  Mathematical  Sciences 
Montana  State  University 
Bozeman,  Montana  59715. 

Adrian  E.  Raficry 

Department  of  Statistics,  GN-22 
University  of  Washington 
Seattle,  Washington  98195. 


ABSTRACT 


l 

! 


I  'r 

■ 


The  classification  maximum  likelihood  approach  is  sufficiently  general  to 
encompass  many  current  clustering  algorithms,  including  those  based  on  the 
sum  of  squares  criterion  and  on  the  criterion  of  Friedman  and  Rubin  (1967). 
However,  as  currently  implemented,  it  does  not  allow  the  specification  of  which 
features  (orientation,  size  and  shape)  are  to  be  common  to  all  clusters  and  which 
may  differ  between  clusters.  Also,  it  is  restricted  to  Gaussian  distributions  and  it 
does  not  allow  for  noise. 


We  propose  ways  of  overcoming  these  limitations.  A  reparameterization  of 
the  covariance  matrix  allows  us  to  specify  that  some  features,  but  not  all,  be  the 
same  for  all  clusters.  A  practical  framework  for  non-Gaussian  clustering  is 
outlined,  and  a  means  of  incorporating  noise  in  the  form  of  a  Poisson  process  is 
described.  An  approximate  Bayesian  method  for  choosing  the  number  of 
clusters  is  given.  ^ 

■"  The  performance  of  the  proposed  methods  is  studied  by  simulation,  with 
encouraging  results.  The  methods  are  applied  to  the  analysis  of  a  data  set  arising 
in  the  study  of  diabetes,  and  the  results  seem  better  than  those  of  previous 
analyses.  .  /».*  Cl  • ' 
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1.  INTRODUCTION 

Cluster  analysis  has  developed  mainly  through  the  invention  and  empirical  investigation  of 
ad-hoc  methods,  in  isolation  from  more  formal  statistical  procedures.  In  recent  years  it  has  been 
found  that  basing  cluster  analysis  on  a  probability  model  can  be  useful  both  for  understanding 
when  existing  methods  are  likely  to  be  successful,  and  for  suggesting  new  methods  (Symons 
1981;  McLachlan  1982;  McLachlan  and  Basford  1988). 

One  such  probability  model  is  that  the  population  of  interest  consists  of  G  different 
subpopulations,  and  that  the  density  of  a  p  -dimensional  observation  x  from  the  k\h 
subpopulation  is  fk(x\Q)  for  some  unknown  vector  of  parameters  8.  Given  observations 

x  =  (xj, .  .  .  ,xn),  we  let  Y=(Yj . yn)T  denote  the  identifying  labels,  where  y,  =  k  if  x,  comes 

from  the  A:  th  subpopulation.  In  the  so-called  classification  maximum  likelihood  procedure,  0  and 
y  are  chosen  so  as  to  maximize  the  likelihood 

i(x;0,r)  =  n/Yl.(x,;0)-  an 

»=i 

Scott  and  Symons  (1971)  have  worked  out  the  solution  when  /*(x;0)  is  multivariate  normal 
with  mean  vector  Hk  and  variance  matrix  Zk,  a  distribution  which  we  denote  by  MVN  (n*,!*). 
When  Zk  =a2I  (k  =  1, . . .  ,G)  this  reduces  to  the  sum  of  squares  criterion  (Gordon,  1981), 
while  when  £*  =L  (k  =  1, . . .  ,G)  it  yields  the  criterion  of  Friedman  and  Rubin  (1967).  For  a 
more  detailed  review  of  these  ideas,  see  Gordon  (1981). 

However,  as  currently  implemented,  the  classification  maximum  likelihood  procedure  has 
several  limitations: 

(1)  It  considers  only  the  restrictive  model  where  the  covariance  matrices  are  constant  across  all 
clusters,  or  the  unparsimonious  model  where  they  are  arbitrary  and  unequal.  The  latter  is 
rarely  used  in  practice,  probably  because  of  difficulties  caused  by  its  veiy  generality  and 
lack  of  parsimony  (Symons  1981).  It  would  seem  desirable  to  have  criteria  based  on 
intermediate  models  which  allow  some  of  the  characteristics  of  the  covariance  matrices  to 
differ  across  clusters.  For  example,  clusters  may  be  elliptical  with  roughly  the  same  size 
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and  shape,  but  oriented  in  different  directions. 

(2)  It  allows  only  for  Gaussian  distributions,  whereas  other  distributions  may  be  more 
appropriate  in  some  situations.  An  example  of  this  arises  frequently  in  unsupervised  pattern 
recognition,  where  edges  may  be  represented  by  points  clustered  uniformly,  rather  than 
normally,  along  a  straight  line. 

(3)  It  does  not,  in  general,  allow  for  noise,  or  data  points  which  do  not  fit  the  prevailing  pattern 
of  clusters.  Indeed,  when  the  covariance  matrices  are  unequal,  each  cluster  must  contain  at 
leastp+1  observations  (Symons,  1981). 

In  this  article,  we  present  a  framework  for  model-based  clustering  which  is  sufficiently 
general  to  overcome  these  limitations.  In  Section  2,  we  develop  maximum  likelihood  criteria  for 
Gaussian  clustering  which  allow  clusters  to  have  different  orientations  or  sizes,  while  preserv  ing 
some  common  features,  such  as  shape.  In  Section  3,  we  present  practical  criteria  for  non- 
Gaussian  clustering,  and  we  extend  the  framework  to  incorporate  Poisson  noise.  In  Section  4, 
we  present  a  model-based  approximate  Bayesian  approach  to  choosing  the  number  of  clusters.  In 
Section  5  we  report  the  results  of  a  Monte  Carlo  study  of  the  methods  presented,  and  in  Section  6 
we  study  their  performance  on  three  data  sets,  of  which  two  are  simulated  and  one  is  real. 

2.  ALLOWING  ORIENTATION  *  VD  SIZE  TO  VARY  BETWEEN  CLL'STERS  IN  THE 
GAUSSIAN  CASE 

When  fk  (x;  0)  is  a  MVN  (n* ,  L* )  density,  the  likelihood  ( 1 . 1 )  has  the  form 

Q 

L (x;  0, y)  =  const,  fj  fj  i Zk  l"%  exp{-'/2(x, -\ik )7  I*-1  (x, -Ji* ) } ,  (2. 1 ) 

*=1  ie£* 

where  Ek  =  [i : y,  =lc }.  The  maximum  likelihood  estimator  of  H*  is  x*  =nkl  £  x,  ,  where  nk  is 

('eEi 

the  number  of  elements  in  Ek.  Substituting  this  into  (2.1)  yields  the  concentrated  log-likelihood 

g 

l  (x;  0,  y)  =  const.  -  xh  £  (triW^E^  +  rtfclogllfc  I }, 

h-\ 


(2.2) 
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where  Wk  is  the  sample  cross-product  matrix  for  the  k  th  cluster,  namely 

Wk  =  I  (xrxk)  (xrxk)r . 
ieF.k 

If  X*  =  o2/  (It  =  1 , . . . ,  G ),  then  the  log-likelihood  (2.2)  is  maximized  by  choosing  y  so  as 

G 

to  minimize  tr(W),  where  W  =  £  Wk.  This  is  the  sum  of  squares  criterion  which  underlies,  for 

example,  Ward’s  (1963)  agglomerative  hierarchical  clustering  method.  If 
Xt=X  (k  ~  1, . . .  ,G),  then  the  log- likelihood  (2.2)  is  maximized  by  choosing  y  so  as  to 
minimize  \W\,  the  criterion  of  Friedman  and  Rubin  (1967).  Finally,  when  the  Zk  are  not 
constrained  in  any  way,  the  likelihood  is  maximized  by  choosing  y  so  as  to  minimize 
G  i  Wk  - 

£  nk  log| - 1 .  This  is  similar  to,  but  not  the  same  as,  equation  (14)  of  Scott  and  Symons 

k= 1  nk 

(1971),  which  we  have  been  unable  to  reproduce  exactly. 

Here  we  develop  new  criteria  which  are  more  general  than  that  of  Friedman  and  Rubin 
(1967),  but  based  on  more  parsimonious  models  than  that  of  Scott  and  Symons  (1971).  They 
allow  some  but  not  all  of  the  features  of  cluster  distributions  (orientation,  size  and  shape)  to  vary 
between  clusters,  while  constraining  others  to  be  the  same.  The  key  to  this  is  a 
reparameterization  of  the  covariance  matrix  I*  in  terms  of  its  eigenvalue  decomposition 

Zk=Dk^kDL  (2-3) 

where  Dk  is  the  matrix  of  eigenvectors  and  A k  is  a  diagonal  matrix  with  the  eigenvalues  of  Lk 
on  the  diagonal.  The  orientation  of  the  principal  components  of  I>k  is  determined  by  Dk,  while 
A*  specifies  the  size  and  shape  of  the  density  contours.  We  write  A.k  =  XkAk,  where  Xk  is  the 
first  eigenvalue  of  Lk,  Ak  =  diag{alt, . . .  ,apk },  and  l=au>(X2*>  •  •  •  >apjt>0.  Thus  Dk 
determines  the  orientation  of  the  k th  cluster,  Xk  its  size,  and  Ak  its  shape.  If  the  <xjk's  are  of 
similar  magnitude,  then  the  £th  cluster  will  tend  to  be  nearly  hyperspherical,  while  if  a2k « 1,  it 
will  be  concentrated  about  a  line,  and  if  =  1  and  CX3*  «  1  it  will  be  concentrated  about  a 
two-dimensional  plane  in  p  -space. 
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This  analysis  suggests  that  the  sum  of  squares  criterion  is  likely  to  be  most  appropriate 
when  the  clusters  are  all  hyperspherical  with  the  same  dispersion  (Symons,  1981).  The  criterion 
of  Friedman  and  Rubin  (1967)  is  likely  to  work  best  when  the  clusters  are  ellipsoidal  with  the 
same  orientation,  size  and  shape.  The  criterion  of  Scott  and  Symons  (1971)  allows  clusters  of 
different  orientations,  sizes  and  shapes,  but  its  very  generality  and  lack  of  parsimony  may  cause 
problems.  For  example,  Symons  (1981)  reported  that  criteria  designed  for  clusters  of  different 
shapes  may  produce  clusters  of  different  shapes  and  sizes  even  when  presented  with 
homogeneous-shaped  clusters  that  are  close  together.  The  criterion  of  Fried  mar.  *nd  Rubin 
(1967)  is  based  on  the  assumption  that  Dk,  Xk  and  Ak  are  the  same  for  each  cluster,  while  the 
criterion  of  Scott  and  Symons  (1971)  assumes  them  all  to  be  different.  By  allowing  some  but  not 
all  of  these  quantities  to  vary  between  clusters,  we  obtain  criteria  that  are  appropriate  for  various 
intermediate  situations. 

Assuming  that  £*  =  XkI  leads  to  a  generalization  of  the  sum  of  squares  criterion.  The  fact 
that  I*  is  a  multiple  of  the  identity  matrix  indicates  that  the  underlying  densities  are  spherical. 
Allowing  Xk  to  vaiy  between  densities  allows  the  sizes  of  the  clusters  to  differ.  The  resulting 
criterion  to  be  minimized  is 

G  Wk 

5>*Iogtr(— ) 
k= 1  nk 

Our  analysis  indicates  that  this  criterion  will  be  most  appropriate  when  the  clusters  are 
hyperspherical,  but  of  different  sizes. 

Next,  we  allow  the  orientations  to  vary  while  keeping  size  and  shape  constant,  by  requiring 
that  Xk  =  A.,  Ak  =A  (k  =  1, . . . ,  G )  where  A  is  known,  and  by  allowing  the  Dk ’s  to  vary  between 

p 

clusters.  By  noting  that  Oa)’  replacing  Dk  and  X  in  (2.2)  with  their  maximum 

j= 1 

likelihood  estimators  and  writing  the  eigenvalue  decomposition  of  Wk  as 

Wk  -  Lk  &k  Lk  (2.4) 

where  Qk  =diag  {d)ljk, . . .  ,(tipk )  and  c 0jk  is  the  y'th  eigenvalue  of  Wk,  we  see  that  the  resulting 
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log-likelihood  is  maximized  by  choosing  y  so  as  to  minimize  S  =  £  Sk,  where  Sk  =  tr(A  1  Qt). 

k= i 

When  p=  2,  this  is  the  criterion  that  underlies  the  clustering  method  of  Murtagh  and  Raliery 
(1984). 

We  now  allow  both  size,  X*,  and  orientation,  Dk,  to  vary  between  clusters,  while  assuming 
that  the  shape  matrix  A  is  constant  across  clusters.  In  this  setting,  the  maximum  likelihood 
estimator  of  y  is  obtained  by  minimizing 

*  c 

S  =  X  nk\og(Sk/nk).  (2.5) 

k= 1 

Table  1  shows  the  relationship  between  the  different  criteria  discussed  in  this  section. 


Table  1 

Constraints  imposed  on  clusters  by  different  criteria 


Criterion 

Origin 

Distribution 

Orientation 

Size 

Shape 

tr(W) 

Ward  (1963) 

Spherical 

None 

Same 

Same 

IWI 

Friedman  and 
Rubin  (1967) 

Ellipsoidal 

Same 

Same 

Same 

S 

Murtagh  and 
Raftery  (1984) 

Ellipsoidal 

Different 

Same 

Same 

I>*log  tr(Wyri*) 

This  article 

Spherical 

None 

Different 

Same 

s* 

This  article 

Ellipsoidal 

Different 

Different 

Same 

G  1  wk  1 

k= 1  nk 

Scott  and 
Symons  (1981) 

Ellipsoidal 

Different 

Different 

Different 

It  is  usually  not  feasible  to  find  the  global  minimum  by  evaluating  the  criterion  for  all 
possible  partitions  of  the  observations.  Many  algorithms  have  been  devised  for  finding  local 
minima  or  good  sub-optimal  solutions,  particularly  for  the  sum  of  squares  criterion.  These 
involve  agglomeration,  iterative  relocation  or  other  methods;  for  reviews  see  Gordon  (1981, 


1987),  Murtagh  (1985)  and  Jain  and  Dubes  (1988).  Algorithms  developed  for  the  sum  of 
squares  criterion  can  be  adapted  for  use  with  the  other  criteria  in  Table  1 .  For  example,  Murtagh 
and  Raftery  (1984)  showed  how  Ward’s  (1963)  agglomerative  hierarchical  method  based  on  the 
sum  of  squares  criterion  can  be  generalized  for  use  with  the  criterion  S . 

3.  NON-GAUSSIAN  CLUSTERING  AND  NOISE 

3.1  Non-Gaussian  clustering:  The  uniform-normal  case 

The  model  (1.1)  is  general  enough  to  encompass  clusters  with  non-Gaussian  distributions. 
To  date,  attention  has  been  focused  on  the  multivariate  normal  distribution  because  it  leads  to 
relatively  simple  criteria.  Here  we  suggest  practical  criteria  for  some  non-Gaussian  situations. 

The  basic  idea  is  the  use  of  a  local  parameterization.  We  assume  that  there  are  matrices 
Dk  (k  =  1, . . .  ,G)  such  that  if  z,  =Dy  (x,-^.),  then  z,  has  the  density  gy  (zt  ;0);  often  these 
densities  will  be  the  same,  perhaps  modulo  a  scale  parameter.  In  this  general  framework,  criteria 
can  be  derived  by  maximizing  the  likelihood,  as  in  Section  2.  When  the  distribution  of  x(  is 
MVNOlfl.Lfl),  and  Dk  is  defined  by  (2.3),  then  z(-  is  the  value  of  x,  in  the  local  coordinate 
system  with  origin  at  (iy  and  axes  along  the  principal  components  of  Ly,. 

We  now  carry  out  a  more  detailed  analysis  of  one  specific,  but  important,  non-Gaussian 
situation.  This  is  when  observations  are  clustered  uniformly  along  and  tightly  about  a  line 
segment  in  p  -space.  Such  situations  arise  in  ecology  when  the  data  include  the  geographic 
locations  of  plants  or  animals  which  may  be  clustered  about  roughly  linear  natural  features  such 
as  rivers  or  valleys.  They  also  arise  in  unsupervised  pattern  recognition,  where  observations  may 
be  edge  elements  in  an  image,  or  data  points  in  a  point  pattern  with  a  linear  feature. 

We  let  Ui-zix,  and  v,  =(vn, . . . , v,  =  (z,-2, . . .  ,zip)T .  We  assume  that  ut  is 

uniformly  distributed  between  0y; j  and  0y.2,  and  that  v,  ~MVN(0, Lk).  Let  0*  anc* 

£ *  =  a*2 1;  typically  c*  will  be  small  compared  to  0* . 
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We  now  derive  an  approximate  maximum  likelihood  estimator  for  y  under  this  model.  We 
estimate  Dk  by  Dk  =Lk,  where  Lk  is  defined  by  (2.4),  and  we  estimate  ji*  by  =xt.  We  then 
define  /.,  =  D (x,  -x^ ),  with  corresponding  definitions  for  u,  and  v(  .  Conditionally  on  these 
estimated  values  of  Dk  and  ji*,  the  log-likelihood  for  0  =  (0J, . . . ,  <|>c  )7 ,  ok  and  y  is 

->  G  ■>  1  r 

/  (x;  0,  c r,  y)  =  const.  -  £  { nk  log0*  +  Vi(p  -1  )nk  logo/  +  — -  £  v/  v, } .  (3.1) 

k=l  2o/  ('e£4 

If  we  assume  that  o/=o2  (£  =  1, . . .  ,G),  then  the  log-likelihood  in  equation  (3.1)  is  maximized 
by 

0*  =  max  [tii )  -  min }, 
i  g  Ej  i  e  Et 

62={n(p-l)r1  £*/>,-. 

1=1 

We  therefore  choose  y  so  as  to  minimize  the  criterion 

^(p-l^logtr  +  2  n*log<j>*.  (3.2) 

*=i 

In  the  situation  where  the  ok  s  are  not  constant  across  clusters  we  obtain 

ieEt 

and  y  is  chosen  so  as  to  minimize  the  criterion 

U  =  X  {1^(p-l)«fclog6/  +  ^log0(fc).  (3.3) 

*=i 

Many  variations  on  this  "uniform-normal"  theme  are  possible,  and  lead  to  simple  criteria. 
For  example,  clusters  may  be  distributed  tightly  about  a  two-dimensional  planar  region  in  p  - 
space;  this  can  be  represented  by  specifying  the  distribution  of  (ztl,zi2)  to  be  concentrated  on 
such  a  region.  Also,  the  distribution  of  the  scatter  about  the  main  line  segment  or  planar  region 
may  be  more  general  than  assumed  above,  leading,  for  examp'°  to  a  range  of  values  for  the 
covariance  matrix  of  v, ,  such  as  those  considered  in  Section  2. 
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3.2  Allowing  for  noise 

So  far,  we  have  assumed  that  each  observation  belongs  to  a  cluster.  However,  even  when  a 

data  set  is  made  up  mainly  of  clusters  of  the  prescribed  type,  there  may  be  other  data  points  that 

do  not  follow  this  pattern.  We  allow  for  this  possibility  by  extending  our  model  to  include  such 

G 

observations,  assumed  to  arise  from  a  Poisson  process  with  intensity  v.  Let  E  =  kJ£*  and 

*=i 

G 

no  =  n~  Then  the  likelihood  (1.1)  is  modified  as  follows: 

k= t 

L(x;e,v,y)  =  vn°e~vA  n/Y,(xi>0)>  (3.4) 

<e£ 

where  A  is  the  hypervolume  of  the  region  from  which  the  data  have  been  drawn.  The  clustering 
criteria  developed  so  far  can  easily  be  modified  so  as  to  be  based  on  (3.4).  Taking  account  of 
noise  in  this  way  facilitates  our  proposed  method  for  choosing  the  number  of  clusters,  described 
in  Section  4. 

4.  CHOOSING  THE  NUMBER  OF  CLUSTERS:  AN  APPROXIMATE  BAYESIAN 
APPROACH 

Here  we  suggest  an  approximate  Bayesian  approach  to  the  choice  of  the  number  of  clusters. 
We  first  write  down  an  exact  Bayesian  solution,  but  this  usually  cannot  be  computed  in  a 
reasonable  amount  of  time.  Arguing  heuristically,  we  obtain  an  approximation  to  the  Bayesian 
solution  which  seems  to  work  well  in  numerical  experiments,  some  of  which  are  reported  in 
Section  6. 

We  view  the  problem  of  estimating  the  number  of  clusters  as  one  of  choosing  between 
competing  models  for  the  same  data.  The  exact  Bayesian  solution  consists  of  finding  the 
posterior  probability  p(G lx)  of  each  number  of  clusters  G  given  the  data  x.  This  approach 
seems  to  have  advantages  over  the  alternative  of  hypothesis  testing  in  the  general  context  of 
model  comparison,  as  it  avoids  the  problems  of  multiple  comparisons,  comparing  non-nested 
models,  and  the  tendency  of  hypothesis  tests  to  select  unparsimonious  models  when  the  sample 
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size  is  large  (Berger  and  Sellke  1987;  Raftery  1986b,  1988b).  The  details  have  been  worked  out 
for  many  statistical  problems,  including  the  general  linear  model  (Smith  and  Spiegelhaltcr 
1980),  generalized  linear  models  (Raftery  1986a,  1988b),  change-points  and  point  processes 
(Akman  and  Raftery  1986;  Raftery  and  Akman  1986),  and  software  reliability  (Raftery  1987, 
1988a). 

Technically,  it  is  easiest  to  start  with  the  Bayes  factor,  or  ratio  of  posterior  to  prior  odds  for 
G  -r  against  G  =$ ,  defined  by 

Brs  =  p  (x  I  G  =r )  Ip  (x  I  G  =s ).  (4.1) 


In  (4. 1 ),  p  (x  I  G  -r )  is  the  marginal  likelihood 

/?(xlG=r)  =  £  J|L(x;0,v,Y)/7(0,v,Y)d0dv, 
Ter, 


where  T,  =  {0,1, .  . . , r }",  L(x;0,v, y)  is  the  generalized  likelihood  defined  by  (3.3),  and 
/?(0,  v,Y)  is  the  joint  prior  density  of  0,  v,  and  Y-  When  Y;=0,  x,  belongs  to  the  "noise"  and 
appears  in  the  Poisson  part  of  the  likelihood  (3.3).  A  different  but  related  approach  is  described 
by  Rissanen  (1988). 

Here  we  concentrate  on  the  approximate  calculation  of  Br  r+1  (r=l . n~\).  This 

yields  posterior  probabilities  p(G=r  lx)  directly,  as  follows.  Noting  that 

s-r 

Brs  =  n^r+f-i.^+(  (r<s)  and  Bsr  -&rsX'  we  calculate  BrSo  for  r=l, .  .  .  ,n~  1  and  some  fixed 
*=i 

s0.  Then 


n-1 


p(G=r  I  x)  =  BrSop(G  =r )/  %  BtSop(G=t), 


t=\ 


(4.2) 


where  p(G=r)  is  the  prior  probability  that  there  are  r  clusters. 

We  approximate  Br  r+ j  as  follows.  In  an  agglomerative  hierarchical  clustering  algorithm, 
the  choice  between  G=r+ 1  and  G-r  is  a  decision  whether  or  not  to  merge  two  particular 
clusters  into  one.  In  the  p  -dimensional  multivariate  normal  case,  this  is  exactly  a  standard 
comparison  of  nested  hypotheses  in  the  general  linear  model,  and  Smith  and  Spiegelhalter 
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(1980)  have  shown  that  in  that  case  minus  twice  the  logarithm  of  the  Bayes  factor  is 
approximately  equal  to 

A., -{|+log(p/i,  ,+1)}8,,  (4.3) 

where  A.,  is  the  likelihood  ratio  test  statistic,  8,  is  the  number  of  degrees  of  freedom  in  the 
asymptotic  chi-square  distribution  of  A,,  and  nr  r+ j  is  the  number  of  observations  in  the  merged 
cluster.  However,  (4.3)  is  invalid  in  the  clustering  context  because  the  regularity  conditions  on 
which  it  is  based  do  not  hold.  Wolfe  (1971)  suggested  getting  around  the  problem  by  doubling 
the  number  of  degrees  of  freedom,  and  Hemandez-Alvi  (1979)  found  that  to  be  a  reasonable 
approximation.  Aitkin,  Anderson  and  Hinde  (1981)  had  some  reservations  about  the  use  of 
Wolfe’s  (1971)  approximation  when  8,  is  large,  but  the  simulations  of  Everitt  (1981)  showed  it 
to  perform  well  for  values  of  8,  between  1  and  5,  which  is  the  range  of  primary  interest  to  us. 
We  therefore  use  the  approximation 

-21ogflr  r+1  ~kr-  {f+log ;(/m,.,+1)}  28, 

=  Er,  (4.4) 


where  8,  is  now  the  decrease  in  the  number  of  parameters  caused  by  going  from  G-r+ 1  to 
G=r. 


In  Table  2,  for  the  case  where  the  data  are  two-dimensional,  we  show  the  values  of  8,  and 
the  individual  cluster  parameters  that  must  be  estimated  for  the  clustering  criteria  from  Sections 


2  and  3.  We  write  D  = 


cosy  -siny 
siny  cosy  J 


,  where  y  is  the  orientation  of  the  cluster.  For  the  criteria 

in  Section  3,  <j>*  can  be  superefficiently  estimated,  and  so  it  is  not  included  in  the  count.  The 
term  A.,  in  (4.4)  involves  only  the  contributions  to  the  likelihood  of  the  clusters  involved  in  the 
merger.  If  we  define  the  maximized  likelihoods  for  the  two  clusters  that  are  merged  as  lk  and  lk~ 
and  the  maximized  likelihood  for  the  cluster  resulting  from  the  merger  as  /* ,  we  may  write 


\r=-2  (V  +  V- /*) 


(4.5) 


The  likelihoods  for  the  clusters  that  are  not  involved  in  the  merger  cancel  out  in  the  likelihood 
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ratio. 


Table  2 

Decrease  in  the  number  of  parameters  caused  by  reducing 
the  number  of  clusters  by  one,  for  several  criteria. 


Criterion 

5 , 

Parameters 

tr(VV) 

2 

\W\ 

2 

S 

3 

G 

2>*logtr(Hy«*) 

*=1 

3 

5* 

4 

M*  >  M'y  »  V  > 

G  Wk  , 

1  «*log|— 1 
k= 1  nk 

5 

Mx.My.  V.A*,(X2* 

Equation  (3.2) 

3 

U 

4 

\ix,[Ly,y,ct 

If  we  assume  that  the  clusters  are  embedded  in  a  Poisson  process,  the  outcomes  of  the 

mergers  are  slightly  more  complicated  since  at  each  stage  in  the  agglomerative  process  the 

number  of  clusters,  G ,  can  increase,  decrease  or  remain  the  same.  The  reason  for  this  is  that  we 

have  two  types  of  data,  clusters  and  noise.  If  we  form  a  new  cluster  by  merging  two  singletons 

that  were  considered  noise,  then  G  will  increase.  If  we  merge  a  singleton  with  an  existing 

cluster,  then  G  will  not  change.  If  two  existing  clusters  are  merged,  then  G  will  decrease.  If 

two  singletons  are  merged  to  form  cluster  k,  then  Xr  =  2 lk,  and  5r  for  the  merger  is  equal  to 

minus  the  value  of  8r  given  in  table  2.  If  a  singleton  is  merged  with  cluster  k  to  form  cluster  k 

then  Xr  =  -2 (lk-  -  lk)  and  5r  =  0  since  the  parameterization  has  not  changed  When  two  existing 
/  // 

clusters,  k  and  k  ,  are  merged  to  form  cluster  k,Xr  is  given  by  equation  (4.4)  and  8,.  is  as  given 
in  Table  2. 

Having  obtained  Br,r+i  (r  =  \, . . .  ,n-\)  from  (4.4),  we  may  calculate 

p(G-r  lx)  (r  =  I, . . .  ,n- 1)  using  (4.2).  A  simple  approach  is  to  use  as  an  estimate  of  the 
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number  of  clusters  the  value  of  r  for  which  p(G=r  I  x)  is  greatest.  However,  (4.4)  provides  a 
rather  crude  approximation  to  p(G=r  I  x).  We  therefore  prefer  to  consider  several  values  of  the 
number  of  clusters,  guided  by  the  values  of  p  (G  -r  !  x),  or,  equivalently,  by 

T  —  1 

Fr  =  £  £<  =  constant + 2  log  p  (G=r  lx).  Following  Good  (1983),  we  refer  to  Fr  as  the 

*=t 

approximate  weight  of  evidence  (AWE)  for  the  number  of  clusters  being  r .  In  our  experience, 
the  change  in  the  approximate  weight  of  evidence,  Er  =Fr+l-Fr  is  often  large  and  positive  for 
the  first  few  values  of  r,  r  =  1, ...  ,R ,  say,  and  small  or  negative  thereafter.  If  that  is  the  case, 
considerations  of  parsimony  lead  us  to  consider  G  =R+ 1,  as  well  as  the  value  of  r  which 
maximizes  the  approximate  weight  of  evidence,  and  any  intervening  values. 

5.  SIMULATION  RESULTS 

To  compare  the  performance  of  our  clustering  criteria  with  that  of  standard,  commonly 
used  criteria,  we  carried  out  a  Monte  Carlo  study.  The  standard  criteria  used  were  the  single-link 
method  (SL),  and  Ward’s  sum  of  squares  criterion,  tr(W).  These  were  compared  with  the  three 
criteria  S,  introduced  for  two  dimensions  by  Murtagh  and  Raftery  (1984)  and  generalized  in 
Section  2,  S*  defined  by  (2.5),  and  (J  defined  by  (3.3). 

To  compare  the  criteria  we  generated  100  random  samples  from  each  of  three  types  of  d  ,fa 
for  each  of  four  values  of  a,  giving  a  total  of  1200  samples.  The  three  types  of  data  correspon. 
to  the  three  models  for  which  S ,  S  *  and  U  are  optimal  criteria.  When  generating  the  data  for 
which  U  was  optimal,  <j>fc  was  generated  from  a  U(.2,  .6)  distribution  and  ak  was  proportional  to 
ouj)*.  Each  sample  consisted  of  three  clusters,  the  orientation  of  each  cluster  was  randomly 
chosen  from  a  U(0,  n)  distribution,  and  the  centers  were  randomly  chosen  in  the  unit  square. 
The  number  of  points  in  each  cluster  was  generated  from  a  discrete  uniform  distribution  on  the 
integers  between  15  and  25. 

Tables  3,  4  and  5  show  the  proportion  of  points  misclassified  by  each  of  the  five  criteria 
considered.  The  single-link  method  performed  poorly,  while  Ward’s  sum  of  squares  did  only 
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slightly  better.  The  three  criteria  5 ,  S  *  and  U  all  performed  much  better.  Of  these  three,  S  *  did 
marginally  better  than  the  others,  but  the  differences  between  them  were  small.  As  one  might 
expect,  each  of  the  three  criteria  S ,  S  *  and  U  performed  best  on  the  type  of  data  for  which  it 
was  designed,  but  it  also  performed  well  on  the  other  kinds  of  data. 

The  clear  superiority  of  5,  5*  and  U  to  the  single-link  and  Ward’s  method  held  for  each 
combination  of  the  three  kinds  of  data  with  the  four  values  of  a.  The  results  for  the  three  kinds 
of  data  were  quite  similar.  As  a  increased,  the  proportion  of  points  misclassified  by  S ,  S  *  and  U 
increased.  This  reflects  the  fact  that  as  a  increases,  the  data  generating  mechanism  more  closely 
approximates  that  for  which  tr(W)  is  the  best  criterion,  and  so  the  superiority  of  S ,  S  *  and  U 
becomes  less  marked.  Averaged  over  the  1,200  random  samples  generated,  the  proportion  of 
points  misclassified  was  16%  for  S ,  14%  for  5  * ,  15%  for  U ,  47%  for  the  single-link  method  and 
43%  for  Ward’s  sum  of  squares. 

It  is  assumed  that  some  prior  information  about  a  is  available.  This  can  come  from  a 
training  sample  or  knowledge  of  the  mechanism  generating  the  data,  for  example  the  resolution 
of  the  edge  detector  used  in  processing  a  digital  image.  Our  numerical  work,  including  the 
analysis  of  Example  3  described  in  Section  6.3,  indicates  that  our  criteria  are  not  sensitive  to 
errors  in  the  estimation  of  a.  In  the  simulation  study  the  correct  value  of  a  was  used  in  5 ,  S  * 
and  U .  This  provides  information  on  the  best  performance  that  can  be  expected. 


6.  EXAMPLES 


6.1  Example  1:  Simulated  clusters 

Figure  1(a)  shows  three  clusters  generated  from  bivariate  normal  distributions  with  the 


same  shape  but  different  sizes  and  orientations.  It  is  typical  of  the  400  random  samples  for 
which  the  results  are  summarized  in  Table  4. 
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Table  3 

Bivariate  normal  clusters  with  the  same  size  and  shape  but  different  orientations.  S  is  the 
optimal  criterion.  100  random  samples  were  generated  for  each  value  of  a.  The  entries  in  the 
table  are  the  percentages  of  points  misclassified. 


Criterion 

a 

.001 

.005 

.01 

.025 

S 

4 

13 

16 

25 

S* 

4 

16 

18 

24 

U 

7 

19 

26 

30 

SL 

51 

51 

52 

52 

tr \W) 

40 

41 

41 

40 

Table  4 

Bivariate  normal  clusters  with  the  same  shape  but  different  sizes  and  orientations.  S  *  is  the 
optimal  criterion.  100  random  samples  were  generated  for  each  value  of  a.  The  entries  in  the 
table  are  the  percentages  of  points  misclassified. 

r 


Criterion 

a 

.001 

.005 

.01 

.025 

5 

14 

17 

24 

31 

5* 

7 

12 

17 

22 

U 

7 

12 

18 

26 

SL 

46 

47 

50 

48 

tr (W) 

48 

48 

46 

46 

Figures  l(b,c,d)  show  the  results  of  grouping  the  data  into  three  clusters  using  the  criteria 
S  ,  tr(W )  and  SL  respectively.  The  S*  criterion  performed  well.  Three  of  the  four  misclassified 
points  are  within  or  close  to  the  intersections  of  the  clusters.  This  is  inevitable,  since  even  the 
human  eye,  with  its  remarkable  pattern  recognition  and  classification  abilities,  finds  it  hard  to 
classify  points  at  the  intersection  of  clusters.  Ward’s  criterion,  tr(W),  misclassified  18  of  the  45 
points  and  did  not  reproduce  the  general  shape  of  the  clusters.  As  can  be  seen  from  Figure  1(c), 
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Table  5 

Bivariate  uniform- normal  clusters.  The  observations  are  clustered  uniformly  along  and  tightly 
about  a  line  segment  in  two-dimensional  space,  as  described  in  Section  3.1.  U  is  the  optimal 
criterion.  100  random  samples  were  generated  for  each  value  of  a.  The  entries  in  the  table  are 
the  percentages  of  points  misclassified. 


Criterion 

a 

.001 

.005 

.01 

.025 

5 

4 

11 

13 

18 

S* 

5 

9 

12 

19 

U 

3 

7 

9 

14 

SL 

38 

41 

45 

43 

tr(W) 

43 

41 

44 

43 

it  tends,  instead,  to  find  "circular"  clusters.  The  single-link  method  has  been  suggested  for 
finding  long  clusters  such  as  those  in  Figure  1(a).  However,  as  can  be  seen  from  Figure  1(d)  and 
Tables  3, 4  and  5,  it  does  not  perform  well  when  the  clusters  intersect. 

Clusters  that  are  physically  separate,  in  whatever  metric  is  being  used,  are  easy  to 
distinguish  with  most  clustering  criteria.  The  clusters  we  have  been  working  with  are 
distinguished  from  each  other  by  their  structure.  A  point  within  one  cluster  may  be  closer,  in 
Euclidean  distance,  to  points  in  other  clusters  than  to  any  point  in  the  cluster  to  which  it  belongs, 
yet  we  are  able  to  classify  it  correctly  due  to  the  structure  of  the  clusters.  For  example,  consider 
Figure  1(b).  Note  the  two  points  on  the  left  that  have  been  correctly  classified  as  belonging  to 
cluster  2  (triangles)  yet  they  are  closer  to  points  in  cluster  3  (diamonds)  than  to  any  point  in 
cluster  2.  Criteria  based  strictly  on  distance  measures,  such  as  the  single  link  method,  are  unable 
to  handle  clusters  that  are  defined  by  their  structure. 
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6.2  Example  2:  Simulated  clusters  with  noise 

Figure  2  shows  three  clusters  with  added  noise.  The  clusters  were  generated  from  bivariate 
normal  distributions  with  the  same  shape  but  different  sizes  and  orientations  while  the  noise  was 
generated  by  a  Poisson  process.  This  example  differs  from  Example  1  in  that  noise  has  been 
added  and  that  we  do  not  assume  the  numbers  of  clusters  to  be  known  in  advance. 

After  clustering  the  data  in  Figure  2  using  S  *  in  a  hierarchical  agglomeration  procedure, 
the  approximate  weight  of  evidence  (AWE)  was  calculated  at  each  iteration,  as  shown  in  Figure 
3.  The  AWE  is  maximized  at  iteration  47  and  falls  off  sharply  after  that,  indicating  that  the 
clustering  algorithm  should  be  stopped  at  the  47th  iteration.  Figure  4  shows  the  results  at 
iteration  47  after  using  an  iterative  relocation  algorithm  to  improve  upon  the  original 
agglomerative  results.  The  three  main  clusters  are  well-defined  with  one  misclassification,  and 
only  one  of  the  noise  points  has  been  misclassified. 

6.3  Example  3:  Diabetes  data 

Reaven  and  Miller  (1979)  described  and  analyzed  data  consisting  of  the  area  under  a 
plasma  glucose  curve  (Glucose  Area),  the  area  under  a  plasma  insulin  curve  (Insulin  Area)  and 
steady  state  plasma  glucose  response  (SSPG)  for  145  subjects.  The  subjects  were  clinically 
classified  into  three  groups,  chemical  diabetes,  overt  diabetes  and  normal  (non-diabetic). 
Symons  (1981)  reanalyzed  the  data  using  seven  different  clustering  criteria.  Taking  the  clinical 
classification  to  be  correct,  we  evaluate  one  of  our  criteria  and  compare  it  with  those  considered 
by  Symons  (1981),  using  the  data  as  published  in  Andrews  and  Herzberg  (1985). 

Reaven  and  Miller  (1979,  Figures  1-4)  showed  four  two-dimensional  projections  of  the 
data.  The  data  have  the  3-dimensional  shape  of  a  boomerang  with  two  wings  and  a  fat  middle. 
One  of  the  wings  corresponds  to  patients  with  oven  diabetes,  the  other  wing  is  composed 
primarily  of  patients  with  chemical  diabetes  and  the  "fat  middle"  is  composed  of  normal 
patients.  By  viewing  the  data  using  a  rotating  3-dimensional  scatterplot,  such  as  the  ones 
provided  in  MacSpin  (Donoho,  Donoho  and  Gasko,  1988)  or  XLISP-STAT  (Tierney,  1988),  this 
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Figure  2.  Three  clusters  with  noise.  The  clusters  were  generated  from  bivariate  normal  distri¬ 
butions  with  the  same  shape  but  different  sizes  and  orientations.  The  noise  was  generated 
from  a  Poisson  process. 


AWE 


Figure  3.  Approximate  weight  of  evidence  (AWE)  for  the  number  of  clusters  in  Figure  2 
using  the  criterion  S * .  The  maximum  occurs  at  iteration  47  and  leads  to  the  clusters  shown 
in  Figure  4. 
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Figure  4.  The  clusters  resulting  from  the  data  in  Figure  2  using  the  criterion  S  *  and  stopping 
at  the  47th  iteration  as  indicated  by  the  AWE  in  Figure  3. 
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structure  is  obvious  and  several  other  features  become  apparent.  One  of  the  "wings"  is  almost 
planar,  the  other  is  linear  with  some  curvature,  and  the  "fat  middle"  has  a  shape  similar  to  an 
American  football.  Four  two-dimensional  projections  of  the  data  are  shown  in  Figure  5. 

Based  on  this  information,  we  could  use  the  approach  developed  in  Sections  2  and  3  to 
design  a  purpose-built  clustering  criterion  for  this  application.  However,  we  prefer  to  use  a  very 
general  criterion  of  the  form  S* ,  where  Ak  =diag{  l,a,  a).  This  criterion  is  optimal  for  trivariate 
normal  clusters  with  different  sizes  and  orientations  but  the  same  "tubular"  shape,  clustered 
circularly  about  a  line  in  R3.  The  estimated  values  of  a  for  the  three  clinically  identified  groups 
are  .09,  .19  and  .34.  The  results  were  relatively  insensitive  to  changes  in  a  so  long  as  it 
remained  in  that  broad  range.  The  results  we  report  are  for  a  =  .2. 

Starting  from  the  correct  clinical  classification  and  using  a  single  point  iterative  relocation 
algorithm  with  the  criterion  S* ,  the  optimal  classification,  as  given  in  Table  6,  resulted  in  only 
10%  of  the  points  being  misclassified.  This  compares  favorably  with  the  results  given  by  all 
seven  clustering  criteria  used  by  Symons  (1981)  for  this  data  set.  We  also  used  a  hierarchical 
agglomerative  clustering  algorithm  followed  by  iterative  relocation.  Once  again,  the  results 
compare  favorably  with  those  of  Symons  (1981).  The  clusters  found  are  shown  in  Figure  6. 

The  AWE  for  the  hierarchical  agglomerative  clustering  algorithm  increased  steadily  until 
the  final  5  iterations.  Figure  7  shows  the  number  of  clusters  versus  the  AWE  over  the  last  20 
iterations.  From  this  it  can  be  seen  that  the  AWE  increases  sharply  as  one  goes  from  one  cluster 
to  two,  and  again  from  two  to  three.  It  increases  slightly  as  the  number  of  clusters  goes  up  to  four 
and  five,  and  decreases  thereafter.  If  we  did  not  know  the  true  number  of  clusters  this  would 
lead  us  to  focus  attention  on  the  groupings  into  three,  four  and  five  clusters,  and  to  perform  a 
more  detailed  analysis  on  these  sets  of  clusters. 
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(c)  <d) 

Glucose  Area  Projections  as  in  R&M 


Figure  5.  Four  two-dimensional  projections  of  the  three-dimensional  diabetes  data  of  Reaven 
and  Miller  (1979).  The  symbols  indicate  the  clinical  classification  of  subjects  as  having 
chemical  diabetes,  overt  diabetes  or  being  normal,  (d)  Shows  the  approximate  projections 
represented  by  the  artist’s  sketch  in  Reaven  and  Miller  (1979)  and  reproduced  in  Symons 
(1981). 


Insulin  Area 


-  24  - 


Glucose  Area 


Figure  6.  The  three  clusters  in  the  diabetes  data  found  by  hierarchical  agglomeration  fol¬ 
lowed  by  iterative  relocation  using  the  criterion  5*  with  Ak  =diag{  1.  .2,  .2).  The  two- 
dimensional  projection  shown  is  that  of  Figure  5(c).  The  symbols  indicate  the  classification 
of  the  subjects  based  on  the  clustering  algorithm.  The  filled-in  symbols  represent  subjects 
whose  clustering  classification  differs  from  the  clinical  classification. 
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Figure  7.  Approximate  weight  of  evidence  (AWE)  for  the  number  of  clusters  in  the  diabetes 
data  over  the  last  20  iterations  of  the  clustering  algorithm.  The  AWE  increases  sharply  up  to 
three  clusters,  with  further  slight  increases  up  to  five  clusters,  and  decreases  thereafter.  This 
would  lead  us  to  focus  on  the  groupings  into  three,  four  and  five  clusters. 
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Table  6 

Results  of  clustering  the  diabetes  data.  The  first  row  shows  the  result  of  single  point  iterative 
relocation  using  the  criterion  S  with  Ak  =diag{  1,  .2,  .2),  starting  with  the  clinical 
classification.  The  second  row  shows  the  result  of  hierarchical  agglomeration  followed  by 
iterative  relocation  with  the  same  criterion.  The  remaining  rows  show  the  results  of  seven  other 
clustering  procedures,  starting  at  the  clinical  classification,  as  reported  by  Symons  (1981). 
Criterion  (13)  of  Symons  ( 1981 )  is  due  to  Maronna  and  Jacovkis  (1974).  The  error  rate  %  is  the 
percentage  of  the  subjects  who  were  not  classified  in  the  same  way  by  the  clustering  method  as 
by  the  clinical  diagnosis. 


Error 

Clinical  classification 

Method 

rate  % 

Normal 

(76,0,0) 

Chemical 

(0,36,0) 

Overt 

(0,0,33) 

S*  from  clinical 

10 

(65,0,0) 

(11,36,4) 

(0,0,29) 

S*  agglomerative 

10 

(65,0,0) 

(11,36,4) 

(0,0,29) 

\W\ 

19 

(73,17,3) 

(3,19,4) 

(0,0,26) 

Reaven  and  Miller  (1979) 
variant  of  1  W\ 

14 

(73,10,1) 

(3,26,6) 

(0,0,26) 

(8)  in  Symons  (1981) 

26 

(75,30,6) 

(1,6,1) 

(0,0,26) 

(10)  in  Symons  (1981) 

26 

(75,30,6) 

(1,6,1) 

(0,0,26) 

(13)  in  Symons  (1981) 

13 

(73,10,0) 

(3,26,7) 

(0,0,26) 

(1 1)  in  Symons  (1981) 

14 

(63,0,0) 

(13,30,2) 

(0,6,31) 

(12)  in  Symons  (1981) 

13 

(73,9,0) 

(3,27,7) 

(0,0,26) 

7.  DISCUSSION 

We  have  proposed  ways  of  overcoming  some  of  the  limitations  of  the  classification 
maximum  likelihood  procedure  for  cluster  analysis,  as  currently  implemented.  These  are  (1)  the 
inability  to  specify  some  but  not  all  features  (orientation,  size,  shape)  to  be  constant  across 
clusters;  (2)  the  restriction  to  normal  distributions;  and  (3)  the  failure  to  account  for  "noise".  We 
have  also  proposed  an  approximate  Bayesian  solution  to  the  problem  of  choosing  the  number  of 
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clusters,  which  seems  to  avoid  some  of  the  difficulties  associated  with  solutions  to  this  problem 
based  on  significance  testing. 

In  the  context  of  Gaussian  clustering,  we  reparameterize  the  covariance  matrices  in  terms 
of  their  eigenvalue  decompositions.  Each  group  of  parameters  then  corresponds  clearly  to  a 
particular  feature  of  the  cluster  (orientation,  size  or  shape),  and  criteria  appropriate  for  a  range  of 
different  situations  result  by  constraining  none,  some  or  all  of  these  features  to  be  constant 
across  clusters.  This  leads  to  a  range  of  criteria  which  are  more  general  than  that  of  Friedman 
and  Rubin  (1967)  and  more  parsimonious  than  that  of  Scott  and  Symons  (1971)  for  the  unequal 
covariance  case.  The  reparameterization  of  covariance  matrices  in  terms  of  the  eigenvalue 
decomposition  has  also  been  considered  by  Flury  (1988)  although  he  did  not  view  it  in  the 
context  of  cluster  analysis  and  he  assumed  the  eigenvector  matrices,  Dk ,  to  be  the  same  across 
all  groups. 

A  general  and  practical  approach  to  non-Gaussian  clustering  is  introduced.  It  is  developed 
in  detail  for  the  important  special  case  where  points  are  distributed  uniformly  along  and  tightly 
about  a  line  segment  in  p  -space.  "Noise"  is  allowed  for  by  permitting  isolated  observations  to 
be  distributed  over  the  data  region  according  to  a  Poisson  process.  We  propose  an  approximate 
Bayesian  method  for  choosing  the  number  of  clusters.  We  also  write  down  the  exact  Bayesian 
solution,  which  is  optimal  given  the  model,  but  is  usually  not  computable;  our  approximation 
seems  to  perform  well  in  numerical  examples. 

An  alternative  specification  of  the  model  (1.1),  which  leads  to  the  so-called  mixture 
maximum  likelihood  approach,  has  been  considered  by  Wolfe  (1970),  Symons  (1981), 
McLachlan  (1982)  and  McLachlan  and  Basford  (1988).  This  assumes  that  x  is  a  random  sample 
from  a  mixture  of  the  G  densities  /*(x;0)  (k=\, . . .  ,G)  in  the  proportions  e  =  (e1( . . .  ,ef;)7 . 
Then  0  and  e  are  estimated,  and  conditional  probabilities  /?(y,=&  I  x,  0,e)  are  calculated. 
Marriott  (1975)  and  Bryant  and  Williamson  (1978)  showed  that  when,  unlike  here,  estimation  of 
0  is  of  primary  interest,  then  the  classification  maximum  likelihood  method  is  inconsistent. 
However,  when  the  covariance  matrices  are  unequal,  the  mixture  maximum  likelihood  approach 
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appears  to  break  down  in  practice  (Day  1969).  McLachlan  and  Basford  (1988,  Section  2.1) 
discuss  some  theoretical  results  which  suggest  that  it  may  be  possible  to  apply  the  mixture 
maximum  likelihood  approach  when  the  covariance  matrices  are  unequal,  but  this  does  not  seem 
to  have  been  done  yet.  If  it  could  be  done,  it  seems  likely  that  the  methods  proposed  in  this  paper 
could  also  be  extended  to  the  mixture  maximum  likelihood  approach  using  the  EM  algorithm 
(McLachlan  and  Basford,  1988,  Section  1.6). 

The  classification  and  mixture  maximum  likelihood  approaches  are  in  conflict  only  when 
the  primary  aim  is  to  estimate  0;  the  conflict  is  resolved  when,  as  here,  the  aim  is  to  estimate  y, 
and  0  is  a  nuisance  parameter.  This  is  easiest  to  see  in  a  Bayesian  framework,  where  the  full 
solution  is  the  posterior  distribution  p  (y  I  x).  It  follows  from  equation  (2  2)  of  Binder  (1978)  that 
this  is  the  same  under  the  two  models  when  the  prior  for  yin  (1.1)  is  hierarchical  and  compatible 
with  the  prior  for  e  in  the  mixture  model.  Thus  the  classification  maximum  likelihood  solution  y 
may  be  viewed  as  a  approximation  to  the  posterior  mode  of  y  under  both  models. 
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The  performance  of  the  proposed  methods  is  studied  by 
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